Contents

Alignment-based metrics

The definition of adequate metrics between objects to be compared is at the core of many machine learning methods (e.g., nearest neighbors, kernel machines, etc.). When complex objects (such as time series) are involved, such metrics have to be carefully designed in order to leverage on desired notions of similarity.

Let us illustrate our point with an example.

Euclidean k-means
kk-means clustering with Euclidean distance. Each subfigure represents series from a given cluster and their centroid (in red).

The figure above is the result of a kk-means clustering that uses Euclidean distance as a base metric. One issue with this metric is that it is not invariant to time shifts, while the dataset at stake clearly holds such invariants.

When using a shift-invariant similarity measure (discussed in our {ref}sec:dtw section) at the core of kk-means, one gets:

Euclidean k-means
kk-means clustering with Dynamic Time Warping. Each subfigure represents series from a given cluster and their centroid (in red).

This part of the course tackles the definition of adequate similarity measures for time series and their use at the core of machine learning methods.

Dynamic Time Warping

This section covers works related to Dynamic Time Warping for time series.

Dynamic Time Warping (DTW) (Sakoe and Chiba 1978) is a similarity measure between time series. Consider two time series 𝐱\mathbf{x} and 𝐱\mathbf{x}^\prime of respective lengths nn and mm. Here, all elements xix_i and xjx^\prime_j are assumed to lie in the same pp-dimensional space and the exact timestamps at which observations occur are disregarded: only their ordering matters.

Optimization Problem

In the following, a path π\pi of length KK is a sequence of KK index pairs ((i0,j0),,(iK1,jK1))\left((i_0, j_0), \dots , (i_{K-1}, j_{K-1})\right).

DTW between 𝐱\mathbf{x} and 𝐱\mathbf{x}^\prime is formulated as the following optimization problem:

DTWq(𝐱,𝐱)=minπ𝒜(𝐱,𝐱)((i,j)πd(xi,xj)q)1q\begin{equation} DTW_q(\mathbf{x}, \mathbf{x}^\prime) = \min_{\pi \in \mathcal{A}(\mathbf{x}, \mathbf{x}^\prime)} \left( \sum_{(i, j) \in \pi} d(x_i, x^\prime_j)^q \right)^{\frac{1}{q}} \label{eq:dtw} \end{equation}

where 𝒜(𝐱,𝐱)\mathcal{A}(\mathbf{x}, \mathbf{x}^\prime) is the set of all admissible paths, i.e. the set of paths π\pi such that:

In what follows, we will denote DTW2DTW_2 as DTW. In this context, a path can be seen as a temporal alignment of time series and the optimal path is such that Euclidean distance between aligned (ie. resampled) time series is minimal.

The following image exhibits the DTW path (in white) for a given pair of time series, on top of the cross-similarity matrix that stores d(xi,xj)d(x_i, {x}^\prime_j) values:

Algorithmic Solution

There exists an O(mn)O(mn) algorithm to compute the exact optimum for this problem (assuming computation of d(,)d(\cdot,\cdot) is O(1)O(1)):

  
def dtw(x, x_prime, q=2):
  for i in 1..n:
    for j in 1..m:
      dist = d(x[i], x_prime[j]) ** q
      if i == 1 and j == 1:
        gamma[i, j] = dist
      else:
        gamma[i, j] = dist + min(gamma[i-1, j] if i > 1
                                               else inf,
                                 gamma[i, j-1] if j > 1
                                               else inf,
                                 gamma[i-1, j-1] if (i > 1 and j > 1)
                                                 else inf)

  return (gamma[n, m]) ** (1. / q)
  

The basic idea behind this algorithm is that there exists a recurrence relationship between partial DTW computations. More precisely, if we denote by γi,j\gamma_{i,j} the DTWqDTW_q (at power qq) similarity between sequences 𝐱i\mathbf{x}_{\rightarrow i} and 𝐱j\mathbf{x}^\prime_{\rightarrow j} (where the notation 𝐱i\mathbf{x}_{\rightarrow i} denotes sequence 𝐱\mathbf{x} observed up to time ii), then we have:

γi,j=DTWq(𝐱i,𝐱j)q=minπ𝒜(𝐱i,𝐱j)(k,l)πd(xk,xl)q=*d(xi,xj)q+minπ𝒜(𝐱i,𝐱j)(k,l)π[:1]d(xk,xl)q=**d(xi,xj)q+min(γi1,j,γi,j1,γi1,j1) \begin{aligned} \gamma_{i,j} &=& DTW_q(\mathbf{x}_{\rightarrow i}, \mathbf{x}^\prime_{\rightarrow j})^q \\ &=& \min_{\pi \in \mathcal{A}(\mathbf{x}_{\rightarrow i}, \mathbf{x}^\prime_{\rightarrow j})} \sum_{(k, l) \in \pi} d(x_k, x^\prime_l)^q \\ &\stackrel{*}{=}& d(x_i, x^\prime_j)^q + \min_{\pi \in \mathcal{A}(\mathbf{x}_{\rightarrow i}, \mathbf{x}^\prime_{\rightarrow j})} \sum_{(k, l) \in \pi[:-1]} d(x_k, x^\prime_l)^q \\ &\stackrel{**}{=}& d(x_i, x^\prime_j)^q + \min (\gamma_{i-1, j}, \gamma_{i, j-1}, \gamma_{i-1, j-1}) \end{aligned}

and DTWq(𝐱,𝐱)DTW_q(\mathbf{x}, \mathbf{x}^\prime) is then (γn,m)1q(\gamma_{n, m})^{\frac{1}{q}}. In more details:

The dynamic programming algorithm presented above relies on this recurrence formula and stores intermediate computations for efficiency.

Dot product notation

Dynamic Time Warping can also be formalized using the following

notation:

DTWq(𝐱,𝐱)=minπ𝒜(𝐱,𝐱)Aπ,Dq(𝐱,𝐱)1q\begin{equation*} DTW_q(\mathbf{x}, \mathbf{x}^\prime) = \min_{\pi \in \mathcal{A}(\mathbf{x}, \mathbf{x}^\prime)} \langle A_\pi, D_q(\mathbf{x}, \mathbf{x}^\prime) \rangle^{\frac{1}{q}} \end{equation*}

where Dq(𝐱,𝐱)D_q(\mathbf{x}, \mathbf{x}^\prime) stores distances d(xi,xj)d(x_i, x^\prime_j) at the power qq and (Aπ)i,j={1 if (i,j)π0 otherwise.\begin{equation} (A_\pi)_{i,j} = \left\{ \begin{array}{rl} 1 & \text{ if } (i, j) \in \pi \\ 0 & \text{ otherwise} \end{array} \right. . \end{equation}

Properties

Dynamic Time Warping holds the following properties:

Invariance to time shifts
Contrary to Euclidean distance, DTW is invariant to time shifts between series.

However, mathematically speaking, DTW is not a valid metric since it satisfies neither the triangular inequality nor the identity of indiscernibles.

Setting Additional Constraints

The set of temporal deformations to which DTW is invariant can be reduced by imposing additional constraints on the set of acceptable paths. Such constraints typically consist in forcing paths to stay close to the diagonal.

The Sakoe-Chiba band is parametrized by a radius rr (number of off-diagonal elements to consider, also called warping window size sometimes), as illustrated below:

The Itakura parallelogram sets a maximum slope ss for alignment paths, which leads to a parallelogram-shaped constraint:

References

Sakoe, Hiroaki, and Seibi Chiba. 1978. “Dynamic Programming Algorithm Optimization for Spoken Word Recognition.” IEEE Transactions on Acoustics, Speech and Signal Processing 26 (1): 43–49.